0.1 Brief Summary of Data And Data Source


This is a personal project. The motivation for this project is to learn new bioinformatics skills related to viral research and expand my knowledge. This project analyzes CMV (Human betaherpesvirus 5) genomic data related to congenital infections. Data analysis comprises finding drug resistance mutations, CMV genotyping, and phylogenetic analysis of samples.

Raw sequence data used in this project has been previously shared in NCBI-SRA by the University of Glasgow (HCMV_Congenital_Collection) under the BioProject Accession Number PRJEB48078 (https://www.ncbi.nlm.nih.gov/bioproject/PRJEB48078). This BioProject contains 3 Genomic DNA submitted by Salvatore Camiolo (University of Glasgow, School of Infection & Immunity). Instrument used was Illumina MiSeq in PRJEB48078. And researchers used hybrid selection as a selection method.

All bash and R-scripts and python notebooks related to the current project have been shared in the GitHub repository https://github.com/osmanmerdan/Congenital-CMV. Tool parameters have been explained in detail in scripts files.

This project does not aim to find optimal best practices to analyze CMV NGS data. For readers who are interested in a user-friendly, fast, all-in-one solution to analyze the CMV genome, please check GRACy(Camiolo et al. (2021)). This HTML document was created using RMarkdown. Plots were created using the ggplot package if otherwise are not indicated.

0.2 CMV Viral Particle


Human cytomegalovirus formally known as Human Herpes Virus 5 (HHV5) is a member of Herpesviridae family. HHV-6, HHV-7, and CMV are classified as betaherpesviruses.

Complete HCMV particles have a diameter of 120-200 nm. HCMV nucleocapsid encircles large (220- to 240-kb) linear double-stranded DNA genome. Outside of nucleocapsid there is matrix and a surrounding phospholipid-rich envelope.

The HCMV genome is the largest among herpesviruses. It consists of more than 170 nonoverlapping ORFs (Open Reading Frames). Genome encodes structural, regulatory and immune modulatory proteins.

CMV is one of most important pathogenes causing congenital infections ultimately leading sensorineural hearing loss and neurodevelopmental problems.

HCMV DNA enters the host cell in linear form. HCMV genome has an unpaired base at each end which facilitates circularization of the genome which enables rolling circle replication. This replication mechanism creates genome length covalently linked copies of viral genome which are called concatemer. Concatemers are later cleved.

HCMV genome has class E organization with two domains that can be inverted relative to each other (Fig 1). That organization yields equal amounts of four genomic isomers that can be isolated from samples. These two domains are the long and short genome segments (L and S). Each domain contains a unique central region (UL and US). Those regions are flanked by repeated components that reside at either the terminal regions of the genome (TRL and TRS) or intersection of L and S segments (IRL and IRS). General organization of HCMV genome is TRL-UL-IRL -IRS-US-TRS. Terminal end internal repeats shares a region of couple of hundreds bps, called a sequence.

Fig 1: CMV wild type Merlin strain structure and possible isomers

Fig 1: CMV wild type Merlin strain structure and possible isomers

0.3 Quality Control Of Trimmed and Filtered NGS Data


Paired end seqeunce data of 36 samples were downloaded from SRA using sratoolkit (3.0.0) fastq-dump. Reads were timmed with Trimmomatic (Bolger, Lohse, and Usadel (2014)) (version 0.39). The minimum sequence length was set to 200 bp. After Trimmomatic operations, FastQC (“FastQC” (2015)) and MultiQC (Ewels et al. (2016)) were performed to get summary statistics about raw-filtered NGS data.

The below table shows some of the summary statistics.



Plot 1: The number of unique and duplicated reads in each trimmed sequencing record, estimated by FastQC.

Plot 1: The number of unique and duplicated reads in each trimmed sequencing record, estimated by FastQC.



HCMV wild-type strain Merlin genome size is 235646 bp. To achieve reliable results from the variant analysis, at least a 20 bp coverage threshold has been set in present NGS data analysis project. That means at least 20000 (200bp long) reads are needed (red line in the Plot 1). Some of the low unique read count samples are:



Fig 2: ERR7018443 (red) and other samples' percent GC contents (gray).

Fig 2: ERR7018443 (red) and other samples’ percent GC contents (gray).



HCMV Merlin strain genome has a GC content of %57. All the samples had some level of deviation from the expected %57, which several issues like contamination, PCR bias, low genome coverage, and hypervariable regions in the CMV genomes may have caused (Fig 2).

Mean quality scores were sufficient enough for downstream analysis (Fig 3).

Fig 3: Base quality

Fig 3: Base quality



0.4 Alignment Results


Reads were aligned to Human herpesvirus five strain Merlin, complete genome (GenBank Accession: AY446894.2) using BWA-MEM (Heng Li (2013)) (Version: 0.7.17-r1188). Alignment files were filtered and sorted using Samtools (Danecek et al. (2021)) (Version: 1.16.1). Output bam files were de-duplicated using Picard MarkDuplicates (“Picard Toolkit” (2018)).
Some important summary results from de duplicated-bam alignment files were presented below.



Samples with mean depth lower than 20 or coverage lower than 90% were listed below.



 Plot 2: Empirical cumulative density function (complementary) visualization for low-coverage low-mean-depth samples

Plot 2: Empirical cumulative density function (complementary) visualization for low-coverage low-mean-depth samples



ERR7018474 and ERR7018433 could cover 50% of the reference genome with depth(X)≥1 (Plot 2). Five samples (ERR7018443, ERR7018474, ERR7021883, ERR7029111, ERR7039821) had low coverage and low depth across the genome (Fig 4). A large chunk of the reference genome had depth(X)<20 for these five samples (Plot 2). They were not further processed.

Fig 4: Depth of coverage heat map with log transformed values

Fig 4: Depth of coverage heat map with log transformed values



There were three regions not covered in any sample, in the beginning, at the end, and around the 200000th position (Fig 4). Those regions correspond to TRL, TRS, IRS, and IRL regions in AY446894.2, which are listed below.



However, there were indications of possible deletions around positions 6000 and 11000 (Plot 3). Record ERR7036363 had zero depth in those two regions (Plot 3). ERR7032218 showed signs of deletion only around position 11000. ERR7024258 was included for comparison.

Plot 3 : Positions between 5000-13000 for samples ERR7036363, ERR7032218, ERR7024258.

Plot 3 : Positions between 5000-13000 for samples ERR7036363, ERR7032218, ERR7024258.



ERR7036363 showed significant sequence variation in the RL6 gene, as indicated by soft-clipped reads (Fig 5). That’s why ER7036363 had zero reads mapped at corresponding positions. ERR7032218 has some SNPs and a deletion in the RL6 gene (Fig 5). ERR7024258 has nucleotide content very similar to the Merlin strain at corresponding parts (Fig 5).

Fig 5: IGV snapshot zoomed in positions between 5600-6600

Fig 5: IGV snapshot zoomed in positions between 5600-6600



ERR7036363 and ERR7032218 showed significant sequence variation with signs of possible deletion in regions corresponding to the genes RL11, RL12, RL13, and UL1 (Fig 6). In addition, ERR7024258 has many sequence variations compared to the Merlin strain in these gene regions (Fig 6).

Fig 6: IGV snapshot zoomed in positions between 9000-14000

Fig 6: IGV snapshot zoomed in positions between 9000-14000



RL6, RL12, RL13, and UL1 genes are members of the RL11 gene family. These regions are known for showing sequence variations between clinical isolates. (Sijmons, Van Ranst, and Maes (2014), Dolan et al. (2004))

In the Fig4 positions between 180800 and 181300, there is a uniform drop in the the number of reads mapped across the samples. However, some samples had a second drop in the number of reads mapped around position 186800. ERR7025502 and ERR7039855 had zero mapped reads in the region neighboring position 187500 (Plot 4). ERR7025502, ERR7039855 and ERR7019208 had zero reads mapped around position 181000 (Plot 4).

Plot 4: Positions between 180000-190000 for samples ERR7039855, ERR7019208, and ERR7025502

Plot 4: Positions between 180000-190000 for samples ERR7039855, ERR7019208, and ERR7025502



UL146 gene has been shown to be particularly variable.(Dolan et al. (2004)) (Fig 7). ERR7025502, ERR7039855 and ERR7019208 had soft clipped reads in UL146 region indicating large amount of sequence variation between samples and reference Merlin strain (Fig 7).

Fig 7: IGV snapshot zoomed in positions between 180600-181600

Fig 7: IGV snapshot zoomed in positions between 180600-181600



Another region showing sequence variation between different HCMV clinical isolates is the UL139 region (Qi et al. (2006)). The alignment status of reads belonging to ERR7025502, and ERR7039855 showed sequence variation in UL139-150A regions (Fig 8).

Fig 8: IGV snapshot zoomed in the UL139 gene

Fig 8: IGV snapshot zoomed in the UL139 gene



In summary, CMV has a lot of regions that pose high levels of sequence variation between clinical strains. Those regions could be rendered as low-coverage, low-depth regions during alignment process.

0.5 Variants


Variants were called on deduplicated-bam files and filtered using bcftools (version 1.16).(H. Li (2011)). Later SnpEff database for AY446894.2 was build following instructions shared on SnpEff website. (http://pcingola.github.io/SnpEff/se_faq/#how-to-building-an-ncbi-genome-genbank-file) Variants in coding regions were annotated using SnpEff (Cingolani, Platts, et al. (2012)). Annotated variants were extracted with SnpSift (Cingolani, Patel, et al. (2012)). Number of different annotations were summarized below table. Note that beacuse the of overlapping genes and regions below numbers are not exact representation of number of variants. During the variant extraction process each variant extracted seperate rows to enable quick downstream pprocess.



Plot 5: Heatmap summarizing number of non-synonymous SNPs for each sample and gene combination

Plot 5: Heatmap summarizing number of non-synonymous SNPs for each sample and gene combination



Some samples had a high number of non-synonymous SNPs at regions RL12 and RL13 (Plot 5). However, nearly half of the samples had zero non-synonymous SNP in RL12 and RL13 (Plot 5). RL12 and RL13 bind Fc portion of human IgG1 and IgG2.(Cortese et al. (2012)) and they show significant number of sequence variations between clinical strains. Manually inspecting alignments files for the R12 and RL13 genes revealed that some samples (e.g.ERR7018455) had no coverage in the region which lead zero detected sequence variation (Fig 9, Fig 10). Hypervariable regions in HCMV genome leads to complicated alignments which can significantly alter the sequence variation results.

Fig 9: IGV capture for read alignments of the RL12 gene.

Fig 9: IGV capture for read alignments of the RL12 gene.



Fig 10: IGV capture for read alignments of the RL13 gene.

Fig 10: IGV capture for read alignments of the RL13 gene.



Plot 6: Box plots for some genes with more than ten mean-ci of non-synonymous SNPs.

Plot 6: Box plots for some genes with more than ten mean-ci of non-synonymous SNPs.



A higher number of non-synonymous SNPs were observed in UL8, UL74, UL75, and UL150 genes (Plot 5, Plot 6). The products of these genes were summarized in the following table.



UL74, and UL75 are involved in cell entry (Ye et al. (2020)). UL74 and UL75 codes for envelope glycoproteins O and H, respectively. UL74 is related to cell tropism and immunomodulation (Sijmons, Van Ranst, and Maes (2014)). UL74 region is also a hypervariable region. UL150 is involved in immune regulation which can explain high number of non-synonymous SNPs.



Membrane glycoprotein 8 is the product of UL8. Each sample had 30 or more non-synonymous SNPs in the UL8 gene (Plot 6). UL8 is shown to be able to reduce the production of pro-inflammatory cytokines significantly. UL7 and UL8 share 195 amino acids (aa) long sequence, which includes ~35 aa signal peptide and ~100 aa IgV-like domain (Pérez-Carmona et al. (2018)). Previous studies have shown that. HCMV strains has %79-%100 sequence identity and most of the sequence variation in the UL8 gene has been observed in the stalk region which is ~140 aa long. Similar results obtained from the variant analysis (Plot 7). 3’ end of UL8 coding region showed high number of sequence variations (Fig 11).

Fig 11: IGV capture for read alignments of the UL8 and neighbouring genes.

Fig 11: IGV capture for read alignments of the UL8 and neighbouring genes.



0.6 Drug Resistance Mutations


Main drugs used to treat CMV infections are Ganciclovir, Foscarnet, Cidofovir, Letermovir and Maribravir. UL54 (DNA polymerase) mutations can alter Ganciclovir, Foscarnet and Cidofovir succeptibility. UL97 kinase mutations effect Ganciclovir and Maribavir succeptibility. Maribavir and Letermovir resistance can arise because of the mutations in UL27 or Terminase gene (UL56, UL89, UL51) respectively.

Annotated-VCF files were searched for the presence of known antiviral resistance-related mutations using the Hrpesvirus Drug Resistance Genotyping web server (http://cmv-resistance.ucl.ac.uk/herpesdrg/). No antimicrobial resistance mutation has been found in any sample. One exciting feature is that most samples have Gln126Leu and Asp605Glu amino acid polymorphisms in the UL97 translated sequence _(Plot8__. Below table summarizes detected polymorphisms by the server. Asp605Glu polymorphism previously shown to be more prevalant in infants with primary HCMV infection compared with HCMV recurrence in transplant recipients (Sun et al. (2012)).




No previous literature record was found for Thr75Ala amino acid variation in UL97 translated sequence. However, it was present in 31 samples (Plot8). Other mutations in UL97 were already mentioned in the literature and are frequently found in clinical samples. (Lurain and Chou (2010)) Characterized

Plot 8: UL97 gene representation with detected non-synonymous mutations. It is shown in the y-axis how many sampless have mutations. (Created by using https://github.com/jianhong/trackViewer package)

Plot 8: UL97 gene representation with detected non-synonymous mutations. It is shown in the y-axis how many sampless have mutations. (Created by using https://github.com/jianhong/trackViewer package)



No previous literature record was found for Gln339Arg and Thr1122Ile amino acid changes in UL54 translated sequence (Plot 9). Number of samples has two SNPs (HGVS_C notation c.3365C>T, c.3364A>G) (Ref Pos 78558 G->A and Ref Pos 78559 T -> C) affecting codon corresponding to amino acid at position 1122 in UL54 protein (Fig 12 and Fig 13). Those two SNPs were pictured as two different amino acid changes (Thr1122Ile and Thr1122Ala) in the annotated vcf files. But combined effect of two SNPs resulted in a codon change from ACC to GTC which actually caused Thr1122Val. There were also some samples with neither of mutations (Thr1122Ile and Thr1122Ala).
Gln339Arg was found in five samples. Remaining mutations were characterized by other researchers.(Lurain and Chou (2010))

Fig 12: Graphic for Merlin Strain UL54 protein sequence. (From NCBI)

Fig 12: Graphic for Merlin Strain UL54 protein sequence. (From NCBI)



Fig 13: Alignment of reads for sampme ERR7021186. 78558G->A and 78559T -> C mutations present in one sample

Fig 13: Alignment of reads for sampme ERR7021186. 78558G->A and 78559T -> C mutations present in one sample



Plot 9: UL54 gene representation with detected non-synonymous mutations. It is shown in the y-axis how many samples have mutations.(Created by using https://github.com/jianhong/trackViewer package)

Plot 9: UL54 gene representation with detected non-synonymous mutations. It is shown in the y-axis how many samples have mutations.(Created by using https://github.com/jianhong/trackViewer package)



There was no resistance related mutations in UL56, UL51, UL89, UL27 genes. Some missense variations were present in nearly all samples (Plot 10).

Plot 10: Number of samples with detected missense variations in UL56, UL51, UL89, UL27 genes

Plot 10: Number of samples with detected missense variations in UL56, UL51, UL89, UL27 genes



0.7 Genotyping


The stain numbers were estimated following previously describes steps (Suárez et al. (2019), Govender et al. (2022)) using the miRNA_Search program (https://github.com/centre-for-virus-research/VATK/tree/master/HCMV_pipeline). This program searches the presence of short motifs in the sequence data. Previously listed short motifs has been used (Govender et al. (2022), Supplementary Tables). However, to minimise biases arised from sequence duplciates FastUniq used to remove duplicate sequences from sequencing reads (Xu et al. (2012)). At this step FASTQ files was turned into FASTA format and later sequence identifiers removed before processing files with miRNA_Search. Output files were parsed using a modified version of perl script shared in HCMV_pipeline. Minimum number of reads that carry one specific motif was set to 10 and the percetange treshold for number of reads that carry any motif was set to 2%. Any genotype motif that did not meet with this requirements was discarded. Motifs representing RL5A, RL6, UL74, RL12, UL11, UL139, UL111A, UL9, UL146, UL120, UL73, US9, RL13, UL1 regions were searched in the reads.

There was no sample which has more than one strain. Only one genotype has been detected for each sample - variable gene combinations(genotype_counts.txt file). Genes, genotypes, number of reads carry that genotype motif and the percetage of reads were summarized in table below.



RL5A-G1, ULL11A-WT, and US9-WT were common elements in all samples. Two different genotypes were spotted in UL9, UL11, UL1, RL6 regions (Plot11). Other regions were little bit more diverse regarding genotypes.


Plot 11: Counts for each genotype

Plot 11: Counts for each genotype

Bolger, Anthony M., Marc Lohse, and Bjoern Usadel. 2014. “Trimmomatic: A Flexible Trimmer for Illumina Sequence Data.” Bioinformatics 30 (15): 2114–20. https://doi.org/10.1093/bioinformatics/btu170.
Camiolo, Salvatore, Nicolás M Suárez, Antonia Chalka, Cristina Venturini, Judith Breuer, and Andrew J Davison. 2021. GRACy: A Tool for Analysing Human Cytomegalovirus Sequence Data.” Virus Evolution 7 (1): veaa099. https://doi.org/10.1093/ve/veaa099.
Cingolani, P., V. M. Patel, M. Coon, T. Nguyen, S. J. Land, D. M. Ruden, and X. Lu. 2012. “Using Drosophila Melanogaster as a Model for Genotoxic Chemical Mutational Studies with a New Program, SnpSift.” Frontiers in Genetics 3.
Cingolani, P., A. Platts, M. Coon, T. Nguyen, L. Wang, S. J. Land, X. Lu, and D. M. Ruden. 2012. “A Program for Annotating and Predicting the Effects of Single Nucleotide Polymorphisms, SnpEff: SNPs in the Genome of Drosophila Melanogaster Strain W1118; Iso-2; Iso-3.” Fly 6 (2): 80–92.
Cortese, Mirko, Stefano Calò, Romina D’Aurizio, Anders Lilja, Nicola Pacchiani, and Marcello Merola. 2012. “Recombinant Human Cytomegalovirus (HCMV) RL13 Binds Human Immunoglobulin G Fc.” Edited by Michael Nevels. PLoS ONE 7 (11): e50166. https://doi.org/10.1371/journal.pone.0050166.
Danecek, Petr, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, et al. 2021. Twelve years of SAMtools and BCFtools.” GigaScience 10 (2). https://doi.org/10.1093/gigascience/giab008.
Dolan, Aidan, Charles Cunningham, Ralph D. Hector, Aycan F. Hassan-Walker, Lydia Lee, Clare Addison, Derrick J. Dargan, et al. 2004. “Genetic Content of Wild-Type Human Cytomegalovirus.” Journal of General Virology 85 (5): 1301–12. https://doi.org/10.1099/vir.0.79888-0.
Ewels, Philip, Måns Magnusson, Sverker Lundin, and Max Käller. 2016. “MultiQC: Summarize Analysis Results for Multiple Tools and Samples in a Single Report.” Bioinformatics 32 (19): 3047. https://doi.org/10.1093/bioinformatics/btw354.
“FastQC.” 2015. https://qubeshub.org/resources/fastqc.
Govender, Kerusha, Raveen Parboosing, Salvatore Camiolo, Petr Hubáček, Irene Görzer, Elisabeth Puchhammer-Stöckl, and Nicolás M. Suárez. 2022. “Complexity of Human Cytomegalovirus Infection in South African HIV-Exposed Infants with Pneumonia.” Viruses 14 (5): 855. https://doi.org/10.3390/v14050855.
Li, H. 2011. “A Statistical Framework for SNP Calling, Mutation Discovery, Association Mapping and Population Genetical Parameter Estimation from Sequencing Data.” Bioinformatics 27 (21): 2987–93. https://doi.org/10.1093/bioinformatics/btr509.
Li, Heng. 2013. “Aligning Sequence Reads, Clone Sequences and Assembly Contigs with BWA-MEM.” https://doi.org/10.48550/ARXIV.1303.3997.
Lurain, Nell S., and Sunwen Chou. 2010. “Antiviral Drug Resistance of Human Cytomegalovirus.” Clinical Microbiology Reviews 23 (4): 689–712. https://doi.org/10.1128/CMR.00009-10.
Pérez-Carmona, Natàlia, Pablo Martínez-Vicente, Domènec Farré, Ildar Gabaev, Martin Messerle, Pablo Engel, and Ana Angulo. 2018. “A Prominent Role of the Human Cytomegalovirus UL8 Glycoprotein in Restraining Proinflammatory Cytokine Production by Myeloid Cells at Late Times During Infection.” Edited by Richard M. Longnecker. Journal of Virology 92 (9): e02229–17. https://doi.org/10.1128/JVI.02229-17.
“Picard Toolkit.” 2018. Broad Institute, GitHub Repository. http://broadinstitute.github.io/picard/; Broad Institute.
Qi, Ying, Zhi-Qin Mao, Qiang Ruan, Rong He, Yan-Ping Ma, Zheng-Rong Sun, Yao-Hua Ji, and Yujing Huang. 2006. “Human Cytomegalovirus (HCMV) UL139 Open Reading Frame: Sequence Variants Are Clustered into Three Major Genotypes.” Journal of Medical Virology 78 (4): 517–22. https://doi.org/10.1002/jmv.20571.
Sijmons, Steven, Marc Van Ranst, and Piet Maes. 2014. “Genomic and Functional Characteristics of Human Cytomegalovirus Revealed by Next-Generation Sequencing.” Viruses 6 (3): 1049–72. https://doi.org/10.3390/v6031049.
Suárez, Nicolás M, Gavin S Wilkie, Elias Hage, Salvatore Camiolo, Marylouisa Holton, Joseph Hughes, Maha Maabar, et al. 2019. “Human Cytomegalovirus Genomes Sequenced Directly From Clinical Material: Variation, Multiple-Strain Infection, Recombination, and Gene Loss.” The Journal of Infectious Diseases 220 (5): 781–91. https://doi.org/10.1093/infdis/jiz208.
Sun, J., G. Cao, L. Zhang, Y. Zhang, Z. Zhao, J. Hu, and Y. Ji. 2012. “Human Cytomegalovirus (CMV) UL97 D605E Mutation Has a Higher Prevalence in Infants With Primary CMV Infection Compared With Transplant Recipients With CMV Recurrence.” Transplantation Proceedings 44 (10): 3022–25. https://doi.org/10.1016/j.transproceed.2012.06.069.
Xu, Haibin, Xiang Luo, Jun Qian, Xiaohui Pang, Jingyuan Song, Guangrui Qian, Jinhui Chen, and Shilin Chen. 2012. FastUniq: A Fast De Novo Duplicates Removal Tool for Paired Short Reads.” Edited by Daniel Doucet. PLoS ONE 7 (12): e52249. https://doi.org/10.1371/journal.pone.0052249.
Ye, Lele, Yunyun Qian, Weijie Yu, Gangqiang Guo, Hong Wang, and Xiangyang Xue. 2020. “Functional Profile of Human Cytomegalovirus Genes and Their Associated Diseases: A Review.” Frontiers in Microbiology 11 (September): 2104. https://doi.org/10.3389/fmicb.2020.02104.